ChatGPT時代に必要かも!? Pythonで実行するファイルパース（PowerPoint編）

あなたのその PowerPoint パースします

#Python

#Microsoft PowerPoint

#ChatGPT

#LlamaIndex

#LangChain

nokomoro3

2023.04.15

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

こんちには。

データアナリティクス事業本部インテグレーション部機械学習チームの中村です。

今回は話題のChatGPTにコンテキストを与える際に必要となるファイルパース処理について見ていきたいと思います。

PowerPointに焦点を絞ってみていきます。既存のライブラリ内の実装も確認していきます。

先行事例の実装

先行事例の実装として、よく話題となる以下のライブラリを見ていきます。

（LlamaIndexとLlamaHubはほぼ同じですが、parserとしては片方にしかないものもあるため）

LlamaIndex
- https://github.com/jerryjliu/llama_index
- https://gpt-index.readthedocs.io/en/latest/index.html
LlamaHub
- https://github.com/emptycrown/llama-hub
- https://llamahub.ai/
LangChain
- https://github.com/hwchase17/langchain
- https://python.langchain.com/en/latest/index.html
chat-gpt-retrieval-plugin
- https://github.com/openai/chatgpt-retrieval-plugin

LlamaIndex

LlamaIndexの場合、slides_parser.pyに実装されています。

依存しているライブラリは以下のようです。

python-pptx
torch
transformers
Pillow

基本はpython-pptxでパースしていますが、画像ファイルにキャプションをつけるため以下のモデルを動かす用途でtorch, transformers, Pillowが必要となっています。

ただしキャプション生成処理は英語のみとなっている点は注意が必要です。

コードの抜粋は以下となります。

presentation = Presentation(file)
result = ""
for i, slide in enumerate(presentation.slides):
    result += f"\n\nSlide #{i}: \n"
    for shape in slide.shapes:
        if hasattr(shape, "image"):
            image = shape.image
            # get image "file" contents
            image_bytes = image.blob
            # temporarily save the image to feed into model
            image_filename = f"tmp_image.{image.ext}"
            with open(image_filename, "wb") as f:
                f.write(image_bytes)
            result += f"\n Image: {self.caption_image(image_filename)}\n\n"

            os.remove(image_filename)
        if hasattr(shape, "text"):
            result += f"{shape.text}\n"

ちなみにpython-pptxなのですが、2021年9月で更新は止まっているようです。

LlamaHub

LlamaHubの実装はLlamaIndexと同様でしたので割愛しますが、以下にその実装があります。

デフォルトでcaption_images=Falseとなっている点は差異として挙げられます。ですのでLlamaIndexと同様の動作をさせたい場合はcaption_images=Trueとする必要があります。

その他詳細な差異はあるかもしれませんが、ここでは割愛します。

LangChain

LangChainの場合は以下に実装されています。

依存ライブラリは以下のようです。

unstructured
magic

magicはファイルタイプの判別のみに使用されているため、主に使用されるのはunstructuredとなります。

unstructuredというライブラリはあまり知らなかったのですが、LibreOfficeやpython-pptxに依存しており、現在も積極的に開発がされているようです。

LibreOfficeはpptをpptxに変換するために使われています。

メインは以下に示すように、python-pptxを使ったパースとなっているようです。

python-pptxを使っている箇所の実装の抜粋は以下のようになっています。

  if filename is not None:
      presentation = pptx.Presentation(filename)
  elif file is not None:
      presentation = pptx.Presentation(file)

  elements: List[Element] = []
  metadata_filename = metadata_filename or filename
  metadata = ElementMetadata(filename=metadata_filename)
  num_slides = len(presentation.slides)
  for i, slide in enumerate(presentation.slides):
      metadata.page_number = i + 1

      for shape in _order_shapes(slide.shapes):
          # NOTE(robinson) - we don't deal with tables yet, but so future humans can find
          # it again, here are docs on how to deal with tables. The check for tables should
          # be `if shape.has_table`
          # ref: https://python-pptx.readthedocs.io/en/latest/user/table.html#adding-a-table
          if not shape.has_text_frame:
              continue
          # NOTE(robinson) - avoid processing shapes that are not on the actual slide
          if shape.top < 0 or shape.left < 0:
              continue
          for paragraph in shape.text_frame.paragraphs:
              text = paragraph.text
              if text.strip() == "":
                  continue
              if _is_bulleted_paragraph(paragraph):
                  elements.append(ListItem(text=text, metadata=metadata))
              elif is_possible_narrative_text(text):
                  elements.append(NarrativeText(text=text, metadata=metadata))
              elif is_possible_title(text):
                  elements.append(Title(text=text, metadata=metadata))
              else:
                  elements.append(Text(text=text, metadata=metadata))

      if include_page_breaks and i < num_slides - 1:
          elements.append(PageBreak())

単純なテキスト抽出の他、NarraviteTextかどうか、タイトルかどうか、箇条書きかどうかなどを推定しようとする点が特徴的です。

またshapeを処理する順番にも気が使われており、左上が順序的に先となるよう_order_shapesで並べ替えをしています。

（逆に言いますとその他のライブラリでは左上が先となっているとは限りません）

chat-gpt-retrieval-plugin

以下にその実装があります。

依存するライブラリは以下のようです。

python-pptx

コードの抜粋は以下となります。

presentation = pptx.Presentation(file)
for slide in presentation.slides:
    for shape in slide.shapes:
        if shape.has_text_frame:
            for paragraph in shape.text_frame.paragraphs:
                for run in paragraph.runs:
                    extracted_text += run.text + " "
            extracted_text += "\n"

シンプルにテキストを抽出するのみの実装となっているようです。

既存ライブラリの実装のまとめ

事実上、python-pptxがもっとも使われる選択肢と考えて良さそうです。

python-pptxの使われ方もそのままのテキストを変換するシンプルなものから、各パーツ毎に区別したり、いくつか工夫点がありそうです。

またLangChainはパースする順序に気を遣ったり、LlamaIndexは画像をキャプションに変換してテキストとして扱ったりが、他にはない特有な要素で、ひと工夫されている印象を受けました。

私なりに実装してみる

先行事例をベースにparserを実装してみます。

いきなり結論

こんな感じにしました。

import pptx
from pptx.presentation import Presentation
from pptx.table import Table
from pptx.text.text import TextFrame
from pptx.shapes.autoshape import Shape
from pptx.slide import Slide
from pptx.parts.image import Image as PptxImage
from pptx.shapes.shapetree import SlideShapes

import magic

import typing as T
from dataclasses import dataclass
import subprocess
from subprocess import CompletedProcess
import pathlib
import tempfile

@dataclass
class ParsedShape():
    shape_type: int
    text: str
    left: int
    top: int
    width: int
    height: int

@dataclass
class ParsedTextShape(ParsedShape):
    pass

@dataclass
class ParsedTableShape(ParsedShape):
    row_num: int
    col_num: int

@dataclass
class ParsedImageShape(ParsedShape):
    pass

@dataclass
class ParsedSlide():
    slide_number: int
    shapes: T.List[ParsedShape]

@dataclass
class ParsedPresentation():
    filename: str
    caption_images: bool
    num_slides: int
    slide_width: int
    slide_height: int
    slides: T.List[ParsedSlide]
    
class PowerPointParser():
    def __init__(self, caption_images: bool = False):
        """初期化

        Args:
            caption_images (bool, optional): 画像をキャプションに変換するかどうか. Defaults to False.
        """
        
        self.caption_images = caption_images

        # キャプション変換が有効な場合はモデル等が必要なためロード
        if self.caption_images:

            from transformers import (
                AutoTokenizer,
                VisionEncoderDecoderModel,
                ViTImageProcessor,
            )

            self.model = VisionEncoderDecoderModel.from_pretrained(
                "nlpconnect/vit-gpt2-image-captioning"
            )
            self.feature_extractor = ViTImageProcessor.from_pretrained(
                "nlpconnect/vit-gpt2-image-captioning"
            )
            self.tokenizer = AutoTokenizer.from_pretrained(
                "nlpconnect/vit-gpt2-image-captioning"
            )

    def parse(self, filename: str) -> ParsedPresentation:
        """パース実行

        Args:
            filename (str): 入力ファイル名

        Raises:
            ValueError: _description_

        Returns:
            ParsedPresentation: パース結果
        """
                    
        mime_type = magic.from_file(filename, mime=True)

        if mime_type == "application/vnd.ms-powerpoint":
            return self.__parse_ppt(filename)
        elif mime_type == "application/vnd.openxmlformats-officedocument.presentationml.presentation":
            return self.__parse_pptx(filename)
    
        # PowerPoint以外は受け付けない
        raise ValueError(
            f"Invalid mime type: {mime_type}."
        )

    def __parse_ppt(self, filename: str) -> ParsedPresentation:
        """旧フォーマットのPowerPointのパース

        Args:
            filename (str): 入力ファイル名

        Raises:
            ValueError: _description_

        Returns:
            ParsedPresentation: パース結果
        """

        with tempfile.TemporaryDirectory() as tmpdir:

            # libreofficeでpptxに変換し、テンポラリフォルダに変換後のファイルを保存
            cp: CompletedProcess[str] = subprocess.run(
                ['soffice', '--headless'
                    , '--convert-to', 'pptx', '--outdir', tmpdir
                    , str(pathlib.Path(filename))
                ], stdout=subprocess.PIPE, stderr=subprocess.PIPE,
            )
            if cp.returncode != 0:
                raise ValueError(
                    f"libreoffice faild : code={cp.stderr.decode()}"
                )
            
            converted_file = pathlib.Path(tmpdir).joinpath(
                pathlib.Path(filename).with_suffix(".pptx").name
            )
        
            return self.__parse_pptx(str(converted_file))

    def __parse_pptx(self, filename: str) -> ParsedPresentation:
        """PowerPointファイルのパース

        Args:
            filename (str): 入力ファイル名

        Returns:
            ParsedPresentation: パース結果
        """

        presentation: Presentation = pptx.Presentation(filename)

        slide_width: int = presentation.slide_width
        slide_height: int = presentation.slide_height

        slides: T.List[ParsedSlide] = []

        # 各スライドのループ
        for slide_number, slide in enumerate(presentation.slides):

            slide: Slide = slide

            shapes: T.List[ParsedShape] = []

            # 各Shapeのループ
            for shape in self.__order_shapes(slide.shapes):

                # テキストありのShapeオブジェクト
                if shape.has_text_frame:
                    text = self.__text_frame_to_text(shape.text_frame)
                    shapes.append(
                        ParsedTextShape(shape_type="text", text=text
                            , left=shape.left, top=shape.top, width=shape.width, height=shape.height)
                    )

                # テーブルありのShapeオブジェクト
                if shape.has_table:
                    row_num = len(shape.table.rows)
                    col_num = len(shape.table.columns)
                    text = self.__table_to_text(table=shape.table, col_num=col_num)
                    shapes.append(
                        ParsedTableShape(shape_type="table", text=text
                            , left=shape.left, top=shape.top, width=shape.width, height=shape.height
                            , row_num=row_num, col_num=col_num)
                    )

                # 画像ファイルありのShapeオブジェクト
                if hasattr(shape, "image") and self.caption_images:
                    text = self.__image_to_text(shape.image)
                    shapes.append(
                        ParsedImageShape(shape_type="image", text=text
                            , left=shape.left, top=shape.top, width=shape.width, height=shape.height)
                    )

            slides.append(
                ParsedSlide(slide_number=slide_number, shapes=shapes)
            )

        # dataclassにまとめる
        result = ParsedPresentation(
            filename=filename
            , caption_images=self.caption_images
            , num_slides=len(presentation.slides)
            , slide_width=slide_width
            , slide_height=slide_height
            , slides=slides
        )

        return result

    def __text_frame_to_text(self, text_frame: TextFrame) -> str:
        """TextFrameのパース

        Args:
            text_frame (TextFrame): Shape内のTextFrame

        Returns:
            str: 抽出した文字列
        """
        text_concat = ""
        for paragraph in text_frame.paragraphs:
            text = paragraph.text
            if text.strip() == "":
                continue
            text_concat += f"{text}\n"
        return text_concat
    
    def __table_to_text(self, table: Table, col_num: int) -> str:
        """テーブルのパース

        Args:
            table (Table): Shape内のテーブル
            col_num (int): テーブルの列数(行の終わり判別用)

        Returns:
            str: 抽出した文字列
        """

        table_text_concat = ""
        for cell_index, cell in enumerate(table.iter_cells()):
            text = self.__text_frame_to_text(cell.text_frame).rstrip("\n")
            table_text_concat += text.replace("\n", " ")
            if cell_index % col_num == col_num - 1:
                table_text_concat += "\n"
            else:
                table_text_concat += ","
        return table_text_concat

    def __image_to_text(self, image: PptxImage) -> str:
        """画像ファイルのパース(英語のみ対応)

        Args:
            image (PptxImage): 画像オブジェクト

        Returns:
            str: 抽出した文字列
        """

        # get image "file" contents
        image_bytes: bytes = image.blob
        
        # temporarily save the image to feed into model
        image_filename = f"tmp_image.{image.ext}"

        with tempfile.TemporaryDirectory() as tmpdir:
            image_path = pathlib.Path(tmpdir).joinpath(image_filename)
            with open(image_path, "wb") as f:
                f.write(image_bytes)
        
            text = self.__caption_image(str(image_path))

        return text

    def __caption_image(self, tmp_image_file: str) -> str:
        """Generate text caption of image._summary_

        Args:
            tmp_image_file (str): 画像ファイル保存先

        Returns:
            str: 生成したキャプション
        """

        import torch
        from PIL import Image

        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(device)

        max_length = 16
        num_beams = 4
        gen_kwargs = {"max_length": max_length, "num_beams": num_beams}

        i_image = Image.open(tmp_image_file)
        if i_image.mode != "RGB":
            i_image = i_image.convert(mode="RGB")

        pixel_values = self.feature_extractor(
            images=[i_image], return_tensors="pt"
        ).pixel_values
        pixel_values = pixel_values.to(device)

        output_ids = self.model.generate(pixel_values, **gen_kwargs)

        preds = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)
        return preds[0].strip()

    def __order_shapes(self, shapes: SlideShapes) -> SlideShapes:
        """Orders the shapes from top to bottom and left to right.

        Args:
            shapes (SlideShapes): ソート前のSlideShapes

        Returns:
            SlideShapes: ソート後のSlideShapes
        """
        return sorted(shapes, key=lambda x: (x.top, x.left))

例外処理は十分ではない可能性がありますので、その点はご留意ください。設計方針やコードの説明は次の項で述べます。

動作確認にはWSLを使用しました。Linux環境であれば同様に動くと思います。

libmagic1とlibreofficeが必要となるので、以下を実行して存在しなかった場合はインストールします。

apt list --installed | grep -e "libmagic1" -e "libreoffice/

libmagic1/focal,now 1:5.38-4 amd64 [installed,automatic]
libreoffice/focal-updates,focal-security,now 1:6.4.7-0ubuntu0.20.04.6 amd64 [installed]

また、Python側で使用したパッケージは以下となります。

pillow==9.5.0
python-magic==0.4.27
python-pptx==0.6.21
torch==2.0.0
transformers==4.27.4

こちらもこのバージョンでなければならないということではないですが、存在しなければpip等でインストールしてください。

設計方針

主となる方針は以下としました。

結果をdataclassで得ることで、json（dict型）に変換することを容易とする
- これをベースに得たいテキストのフォーマットに後で自由に変換が可能
テキスト抽出だけではない用途を考慮し、レイアウトやページ番号などのメタデータも取得
- 根拠のページ番号などが後々欲しくなることなどを考慮
先行事例が対応していないテーブルに対する抽出を実施
画像はLlamaIndexのキャプション変換をそのまま踏襲
パースする順序はLangChainを踏襲して左上が先となるよう実装
またLangChainと同様にpptはLibreOfficeでpptxに一度変換

コードの説明

先行事例の実装と大きくは変わらないのですが、解説を兼ねてコードの説明をします。

pptxをパースするときの階層構造

pptxファイルは以下のような階層構造となっています。

pptx.presentation.Presentation オブジェクト
├ pptx.slide.Slide オブジェクト
│  ├ pptx.shapes.autoshape.Shape オブジェクト
│  └ pptx.shapes.autoshape.Shape オブジェクト
├ pptx.slide.Slide オブジェクト
│  ├ pptx.shapes.autoshape.Shape オブジェクト
│  └ pptx.shapes.autoshape.Shape オブジェクト
～(以降略)～

上記のShapeオブジェクトが、スライド内の部品に相当しています。

Shapeオブジェクトについて

pptx.shapes.autoshape.Shapeからサブクラスとしてパーツに応じたShapeオブジェクトに派生しています。

今回は一つ一つのShapeオブジェクトの種類毎に処理は分けておらず、Shapeオブジェクトを以下のように大きく３タイプに分けて処理しています。

                # テキストありのShapeオブジェクト
                if shape.has_text_frame:
                    # テキストを持つShapeオブジェクトの処理

                # テーブルありのShapeオブジェクト
                if shape.has_table:
                    # テーブルを持つShapeオブジェクトの処理

                # 画像ファイルありのShapeオブジェクト
                if hasattr(shape, "image") and self.caption_images:
                    # 画像を持つShapeオブジェクトの処理

Shapeオブジェクトにはshape_typeが格納されており、これで詳細な種類を判別することも可能です。

各タイプの意味は以下をご覧ください。

ちなみに今回ブログを書くにあたって確認できたShapeオブジェクトの種類には以下がありました。

EnumMember("PLACEHOLDER", 14, "Placeholder"),
EnumMember("TEXT_BOX", 17, "Text box"),
EnumMember("PICTURE", 13, "Picture"),
EnumMember("AUTO_SHAPE", 1, "AutoShape"),
EnumMember("LINE", 9, "Line"),
EnumMember("TABLE", 19, "Table"),

shape_type毎に持っている属性値(Attributes)は異なるため、細かい処理を入れていく場合は以下のようにAttributesの一覧を取得して確認すればよいと思います。

# ご参考 : Attributesを取得するスニペット
attributes = dir(shape)
attributes = [attr for attr in attributes if attr[0] != "_"]
print(f"{attributes=}")

Shapeオブジェクトの格納順

Slideのshapes属性には、Shapeオブジェクトが上から順に格納されているわけではありません。

ですので、LangChainの実装にあるようにそのshapeの位置でソートする処理を実装しています。

            for shape in self.__order_shapes(slide.shapes):
                # 各Shapeオブジェクトに対する処理

    def __order_shapes(self, shapes: SlideShapes) -> SlideShapes:
        """Orders the shapes from top to bottom and left to right.

        Args:
            shapes (SlideShapes): ソート前のSlideShapes

        Returns:
            SlideShapes: ソート後のSlideShapes
        """
        return sorted(shapes, key=lambda x: (x.top, x.left))

テキストをもつShapeオブジェクトに対する処理

この処理はself.__text_frame_to_textに実装しており、pptx.text.text.TextFrameに対する処理となっています。

    def __text_frame_to_text(self, text_frame: TextFrame) -> str:
        """TextFrameのパース

        Args:
            text_frame (TextFrame): Shape内のTextFrame

        Returns:
            str: 抽出した文字列
        """
        text_concat = ""
        for paragraph in text_frame.paragraphs:
            text = paragraph.text
            if text.strip() == "":
                continue
            text_concat += f"{text}\n"
        return text_concat

TextFrameはparagraphs属性を持ち、改行毎に別のppt.text.text._Paragraphオブジェクトとなっているため、それぞれのテキストを改行区切りで連結しています。

テーブルをもつShapeオブジェクトに対する処理

この処理はself.__table_to_textに実装しており、pptx.table.Tableに対する処理となっています。

    def __table_to_text(self, table: Table, col_num: int) -> str:
        """テーブルのパース

        Args:
            table (Table): Shape内のテーブル
            col_num (int): テーブルの列数(行の終わり判別用)

        Returns:
            str: 抽出した文字列
        """

        table_text_concat = ""
        for cell_index, cell in enumerate(table.iter_cells()):
            text = self.__text_frame_to_text(cell.text_frame).rstrip("\n")
            table_text_concat += text.replace("\n", " ")
            if cell_index % col_num == col_num - 1:
                table_text_concat += "\n"
            else:
                table_text_concat += ","
        return table_text_concat

Tableはiter_cellsでセル毎のループ処理が可能です。各列間はカンマ区切りとし、各行間は改行区切りのフォーマットにしています。

ただし各行の末端を判別する方法がわからなかったため、列数を引数にとってそこで判別させています。

画像ファイルをもつShapeオブジェクトに対する処理

この処理はself.__image_to_textに実装しており、pptx.parts.image.Imageに対する処理となっています。

    def __image_to_text(self, image: PptxImage) -> str:
        """画像ファイルのパース(英語のみ対応)

        Args:
            image (PptxImage): 画像オブジェクト

        Returns:
            str: 抽出した文字列
        """

        # get image "file" contents
        image_bytes: bytes = image.blob

        # temporarily save the image to feed into model
        image_filename = f"tmp_image.{image.ext}"

        with tempfile.TemporaryDirectory() as tmpdir:
            image_path = pathlib.Path(tmpdir).joinpath(image_filename)
            with open(image_path, "wb") as f:
                f.write(image_bytes)

            text = self.__caption_image(str(image_path))

        return text

    def __caption_image(self, tmp_image_file: str) -> str:
        """Generate text caption of image._summary_

        Args:
            tmp_image_file (str): 画像ファイル保存先

        Returns:
            str: 生成したキャプション
        """

        import torch
        from PIL import Image

        device = "cuda" if torch.cuda.is_available() else "cpu"
        self.model.to(device)

        max_length = 16
        num_beams = 4
        gen_kwargs = {"max_length": max_length, "num_beams": num_beams}

        i_image = Image.open(tmp_image_file)
        if i_image.mode != "RGB":
            i_image = i_image.convert(mode="RGB")

        pixel_values = self.feature_extractor(
            images=[i_image], return_tensors="pt"
        ).pixel_values
        pixel_values = pixel_values.to(device)

        output_ids = self.model.generate(pixel_values, **gen_kwargs)

        preds = self.tokenizer.batch_decode(output_ids, skip_special_tokens=True)
        return preds[0].strip()

ほとんどLlamaIndexの処理を踏襲していますので説明は割愛しますが、キャプション生成に使用するモデルが英語対応となるため、得られる結果も英語のものとなります。

動かしてみた

今回は以前登壇したときのスライド資料を入力にやってみます。

作成したクラスをpower_point_parser.pyに保存したとして、以下のようにすれば動かすことができます。


from power_point_parser import PowerPointParser, ParsedPresentation

import json
from dataclasses import asdict

parser = PowerPointParser(caption_images=True)

result: ParsedPresentation = parser.parse("2022-08-04_AKIBA.AWS_lookoutvision.pptx")

print(json.dumps(asdict(result), indent=2))

得られる結果は、dataclassとなっているため、asdictでdict型に変換することが可能です。

dict型を見やすくするためJSONにした結果が以下です。

{
  "filename": "2022-08-04_AKIBA.AWS_lookoutvision.pptx",
  "caption_images": true,
  "num_slides": 21,
  "slide_width": 12192000,
  "slide_height": 6858000,
  "slides": [
    {
      "slide_number": 0,
      "shapes": [
        {
          "shape_type": "text",
          "text": "1\n",
          "left": 10748210,
          "top": 0,
          "width": 1443790,
          "height": 767700
        },
        {
          "shape_type": "text",
          "text": "Amazon Lookout for Vision\u3067\n\u7b46\u8de1\u9451\u5b9a\u3057\u3066\u307f\u305f\n",
          "left": 838200,
          "top": 3618465,
          "width": 10515599,
          "height": 602554
        },
        {
          "shape_type": "text",
          "text": "2022-08-04(\u6728)\n",
          "left": 838200,
          "top": 5401721,
          "width": 10515600,
          "height": 443651
        },
        {
          "shape_type": "text",
          "text": "\u30c7\u30fc\u30bf\u30a2\u30ca\u30ea\u30c6\u30a3\u30af\u30b9\u4e8b\u696d\u672c\u90e8\u3000\u4e2d\u6751\n",
          "left": 838200,
          "top": 5892501,
          "width": 10515600,
          "height": 328403
        }
      ]
    },
    {
      "slide_number": 1,
      "shapes": [
        {
          "shape_type": "text",
          "text": "\u81ea\u5df1\u7d39\u4ecb\n",
          "left": 0,
          "top": 0,
          "width": 10515600,
          "height": 767700
        },
        {
          "shape_type": "text",
          "text": "2\n",
          "left": 10748210,
          "top": 0,
          "width": 1443900,
          "height": 767700
        },
        {
          "shape_type": "text",
          "text": "\u25c6nakamura.shogo\n\u25c62022\u5e742\u6708\u5165\u793e\n\u25c6\u30c7\u30fc\u30bf\u30a2\u30ca\u30ea\u30c6\u30a3\u30af\u30b9\u4e8b\u696d\u672c\u90e8\u6a5f\u68b0\u5b66\u7fd2\u30c1\u30fc\u30e0\u6240\u5c5e\n\u25c6\u3084\u3063\u3066\u3044\u308b\u3053\u3068\uff1a\n\u6a5f\u68b0\u5b66\u7fd2\u6848\u4ef6\u306e\u5206\u6790\u30fb\u74b0\u5883\u69cb\u7bc9\u3001\u8ad6\u6587\u8aad\u307f\n\u6700\u8fd1\u8aad\u3093\u3060\u8ad6\u6587\u306fYOLOv7\u3067\u3001\u30d6\u30ed\u30b0\u306b\u3082\u3057\u3066\u307e\u3059\nhttps://dev.classmethod.jp/articles/yolov7-train-with-customize-dataset/\nhttps://dev.classmethod.jp/articles/yolov7-architecture-overall/\n",
          "left": 373025,
          "top": 1148275,
          "width": 11458800,
          "height": 5316900
        },
        {
          "shape_type": "image",
          "text": "a teddy bear sitting on top of a blue background",
          "left": 8736777,
          "top": 1148276,
          "width": 3095047,
          "height": 2914500
        },
        {
          "shape_type": "image",
          "text": "a blue and white sign on a blue and white sign",
          "left": 8736777,
          "top": 4169030,
          "width": 1017375,
          "height": 1017375
        },
        {
          "shape_type": "image",
          "text": "a blue and white sign on a blue and white sign",
          "left": 9693877,
          "top": 4169030,
          "width": 1017375,
          "height": 1017375
        }
      ]
    },
    {
      "slide_number": 2,
      "shapes": [
        {
          "shape_type": "text",
          "text": "\u4eca\u65e5\u306e\u304a\u8a71\n",
          "left": 0,
          "top": 0,
          "width": 10515600,
          "height": 767700
        },
        {
          "shape_type": "text",
          "text": "3\n",
          "left": 10748210,
          "top": 0,
          "width": 1443900,
          "height": 767700
        },
        {
          "shape_type": "text",
          "text": "\u25c6Amazon Lookout for Vision\u3068\u306f\n\u3000\u3000\u30fb\u6982\u8981\u3068\u4e3b\u306a\u30d5\u30ed\u30fc\n\u25c6\u81ea\u4f5c\u30c7\u30fc\u30bf\u30bb\u30c3\u30c8\u3067\u306e\u691c\u8a3c\n\u3000\u3000\u30fb\u7f72\u540d\u3092\u79c1\u81ea\u8eab\u304b\u305d\u308c\u4ee5\u5916\u304b\u3092\u5224\u5b9a\u3059\u308b\u30bf\u30b9\u30af\n\u25c6\u305d\u306e\u4ed6\u88dc\u8db3\u60c5\u5831\n\u3000\u3000\u30fb\u691c\u8a3c\u3092\u9032\u3081\u308b\u4e2d\u3067\u6c17\u3065\u3044\u305f\u70b9\u306a\u3069\u306a\u3069\n",
          "left": 373025,
          "top": 1148275,
          "width": 11458800,
          "height": 5316900
        },
        {
          "shape_type": "image",
          "text": "a white paper with a bunch of numbers on it",
          "left": 3622830,
          "top": 3333195,
          "width": 4946340,
          "height": 1674504
        }
      ]
    },
    // ～以降のスライド略～
  ]
}

プレーンなテキストとして得たい場合は以下のようにdict型のデータを元に作成すればよいでしょう。

plain_text = ""
for slide in asdict(result)["slides"]:
    plain_text += "-"*50 + "\n"
    plain_text += f"slide_number: #{slide['slide_number']}\n"
    plain_text += "-"*50 + "\n"
    for shape in slide["shapes"]:
        plain_text += f"{shape['shape_type']}:\n\n"
        plain_text += f"{shape['text']}\n"

print(plain_text)

このコードの場合は以下のようなテキストとなります。お好みに応じて調整下さい。

--------------------------------------------------
slide_number: #0
--------------------------------------------------
text:

1

text:

Amazon Lookout for Visionで
筆跡鑑定してみた

text:

2022-08-04(木)

text:

データアナリティクス事業本部　中村

--------------------------------------------------
slide_number: #1
--------------------------------------------------
text:

自己紹介

text:

2

text:

◆nakamura.shogo
◆2022年2月入社
◆データアナリティクス事業本部機械学習チーム所属
◆やっていること：
機械学習案件の分析・環境構築、論文読み
最近読んだ論文はYOLOv7で、ブログにもしてます

image:

a teddy bear sitting on top of a blue background
image:

a blue and white sign on a blue and white sign
image:

a blue and white sign on a blue and white sign

～以降のスライド略～

まとめ

いかがでしたでしょうか。意外と執筆に時間がかかりましたが、PythonをPowerPointをパースする際の挙動を網羅的に抑えられたのではではないかと思います。

本記事がPowerPointをパースしようと苦労されている方の参考になれば幸いです。

ChatGPT時代に必要かも!? Pythonで実行するファイルパース（PowerPoint編）

先行事例の実装

LlamaIndex

LlamaHub

LangChain

chat-gpt-retrieval-plugin

既存ライブラリの実装のまとめ

私なりに実装してみる

いきなり結論

設計方針

コードの説明

pptxをパースするときの階層構造

Shapeオブジェクトについて

Shapeオブジェクトの格納順

テキストをもつShapeオブジェクトに対する処理

テーブルをもつShapeオブジェクトに対する処理

画像ファイルをもつShapeオブジェクトに対する処理

動かしてみた

まとめ

関連記事

主なカテゴリ

AWSで探す

注目のテーマ

プロダクトやサービスで探す

特集やシリーズから探す

お問い合わせ

運営会社